Skip to content

fix: Summit cold start from checkpoint deadlock#131

Merged
HenryMBaldwin merged 14 commits intoh/staking-and-joiningfrom
h/fix-summit-checkpoint-deadlock
Mar 3, 2026
Merged

fix: Summit cold start from checkpoint deadlock#131
HenryMBaldwin merged 14 commits intoh/staking-and-joiningfrom
h/fix-summit-checkpoint-deadlock

Conversation

@HenryMBaldwin
Copy link
Contributor

@HenryMBaldwin HenryMBaldwin commented Feb 26, 2026

Issue

When starting a wiped node from a summit checkpoint where the execution client has clean state, summit deadlocks (irrecoverably AFAIK) because it treats SYNCING from the execution client as a failure.

Solution

This is solved with an initial syncing phase, prompted by sending an initial forkchoice update to the execution client to set the sync target, and polling until it's done syncing.

We also attempt to sync in execute block if exec client returns syncing. This situation means something has likely gone catastrophically wrong, and should never happen with our reth client, so we log an error instead of a warn while attempting to wait for exec client to recover.

Notably, if the execution client doesn't have any peers or is unable to sync for any reason, this is also effectively stalls the node. IMO this is still a strict improvement over previous behavior.

Copy link
Collaborator

@daltoncoder daltoncoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice looks good just some minor comments

Comment on lines +242 to +248
} else {
warn!(
?status,
"unexpected response to initial forkchoice update, proceeding anyway"
);
break;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only other status not covered in the other branches is Invalid

    /// INVALID is returned by the engine API in the following calls:
    ///   - forkchoiceUpdate: if the new head is unknown, pre-merge, or reorg to it fails

Since this is on startup this is going to be unrecoverable for a node and will crash as soon as he gets a block. Lets just panic here now with a helpful message something like "Finalizer started with invalid forkchoice"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

height = block.height(),
"execution client returned SYNCING, sending forkchoice update to trigger sync and retrying..."
);
engine_client.commit_hash(state.forkchoice).await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this in the office but this commit is not needed and we should remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment on lines +1176 to +1181
(true, false) => {
warn!(
new_height,
"payload valid but parent hash mismatch, not executing"
);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as i can tell this branch is unreachable. Waiting for @matthias-wright to also check this out but if that is the case this whole thing can be simplified to just check payload_status.is_valid()

Comment on lines +43 to +45
// Response override queues
check_payload_overrides: VecDeque<PayloadStatus>,
commit_hash_overrides: VecDeque<ForkchoiceUpdated>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool nice solution to fixing up the MockEngineClient

pub const BLOCKS_PER_EPOCH: u64 = 10;
#[cfg(all(not(debug_assertions), not(feature = "e2e")))]
const BLOCKS_PER_EPOCH: u64 = 10000;
const BLOCKS_PER_EPOCH: u64 = 50;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets revert this until the PR that adds this to Genesis file. It will break some test binaries we have

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

@HenryMBaldwin HenryMBaldwin merged commit a9feea4 into h/staking-and-joining Mar 3, 2026
@HenryMBaldwin HenryMBaldwin deleted the h/fix-summit-checkpoint-deadlock branch March 3, 2026 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants